Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition
نویسندگان
چکیده
By extracting local spatial-temporal features from videos, many recently proposed approaches for action recognition achieve promising performance. The Bag-of-Words (BoW) model is commonly used in the approaches to obtain the video level representations. However, BoW model roughly assigns each feature vector to its closest visual word, therefore inevitably causing nontrivial quantization errors and impairing further improvements on classification rates. To obtain a more accurate and discriminative representation, in this paper, we propose an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. In so doing, each local spatial-temporal feature is transformed to a linear combination of a few “atoms” in a trained dictionary. In addition, we also investigate the construction of the dictionary under the guidance of transfer learning. We collect a large set of diverse video clips of sport games and movies, from which a set of universal atoms composed of the dictionary are learned by an online learning strategy. We test our approach on KTH dataset and UCF sports dataset. Experimental results demonstrate that our approach outperforms the state-of-art techniques on KTH dataset and achieves the comparable performance on UCF sports dataset.
منابع مشابه
Feature extraction and representation for human action recognition
Human action recognition, as one of the most important topics in computer vision, has been extensively researched during the last decades; however, it is still regarded as a challenging task especially in realistic scenarios. The difficulties mainly result from the huge intra-class variation, background clutter, occlusions, illumination changes and noise. In this thesis, we aim to enhance human...
متن کاملEfficient Local Feature Encoding for Human Action Recognition with Approximate Sparse Coding
Local spatio-temporal features are popular in the human action recognition task. In practice, they are usually coupled with a feature encoding approach, which helps to obtain the video-level vector representations that can be used in learning and recognition. In this paper, we present an efficient local feature encoding approach, which is called Approximate Sparse Coding (ASC). ASC computes the...
متن کاملFace Recognition using an Affine Sparse Coding approach
Sparse coding is an unsupervised method which learns a set of over-complete bases to represent data such as image and video. Sparse coding has increasing attraction for image classification applications in recent years. But in the cases where we have some similar images from different classes, such as face recognition applications, different images may be classified into the same class, and hen...
متن کاملAction recognition via spatio-temporal local features: A comprehensive study
Local methods based on spatio-temporal interest points (STIPs) have shown their effectiveness for human action recognition. The bag-of-words (BoW) model has been widely used and dominated in this field. Recently, a large number of techniques based on local features including improved variants of the BoW model, sparse coding (SC), Fisher kernels (FK), vector of locally aggregated descriptors (VL...
متن کاملHuman Action Recognition Using LBP-TOP as Sparse Spatio-Temporal Feature Descriptor
In this paper we apply the Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) descriptor to the field of human action recognition. A video sequence is described as a collection of spatial-temporal words after the detection of space-time interest points and the description of the area around them. Our contribution has been in the description part, showing LBP-TOP to be a promising descrip...
متن کامل